Hashing¶
Remember Binary Search?¶
Binary Search Analysis¶
$$ T(n) = \left\{ \begin{array}\\ c & \mbox{if } \ n = 1 \\ c+T(n/2) & \mbox{if } \ n > 1 \\ \end{array} \right.$$
$T(n) = c + T(n/2)) \implies T(n) = c + [c + T(n/4)] \implies T(n) = 2c + T(n/4)$
$T(n) = 2c + [c + T(n/8)] \implies T(n) = 3c + T(n/8)$
$T(n) = 3c + T(n/2^3)$
$T(n) = 3c + [c + T(n/2^4)] \implies T(n) = 4c + T(n/2^4)$
Searching based on Comparisons¶
Searching based on comparisons¶
$N$ elements have to be searched.
That requires how many pairwise comparisons?
Depth of the decision structure is...?
- Level 0: 1
- Level 1: 2
- Level 2: 4
- Level 3: 8
Searching based on comparisons¶
At level $k$, I have $N$ possibilities
$\implies 2^k = N$
$\implies k = \log_2(N)$
$k \approx (\log N)$
Suppose we have a set of products with product IDs: $[1100,\ldots,1199] $
We have an array $A[0] - A[99]$.
- $A[0] : 1100$
- $A[1] : 1101$
- $A[2] : 1102$
- ...
- $A[99]: 1199$
How did we get here?¶
$A[0] : 1100$
$A[1] : 1101$
$A[2] : 1102$
...
$A[99]: 1199$
1101%1100 = 1
1102%1100 = 2
1127%1100 = 27
Hashing¶
- The keys are stored in an array called a hash table
- A hash function maps the search keys to specific entries in the table.
Hashing¶
Keys: 765, 431, 96, 142, 579, 226, 903, 388
Hash Table T contains M = 13 elements
Hash Function $h$ (key) = key % $M$
keys=[765, 431, 96, 142, 579, 226, 903, 388]
M = 13
for key in keys:
print(key,"-->",key%M)
765 --> 11 431 --> 2 96 --> 5 142 --> 12 579 --> 7 226 --> 5 903 --> 6 388 --> 11
96 --> 5
226 --> 5
Entry [5] already contains 96
This is called a Collision
Collision: Linear Probing¶
Linear Probing¶
- We must resolve the collision by..
- probing the table to find another available slot.
- using a linear probe, which
- examines the table entries in sequential order
- starting with the first entry immediately following the original hash location.
- While probing, if the end of the array is reached, wrap around.
Searching¶
Same as insert.
- Find the hash value
- Start a linear search from there on, till we encounter the number or a null (empty) value.
Deletions¶
Search the key to be deleted.
Simply remove it by setting the corresponding table entry to None.
This does not work. Why?
Have an indicator that this entry was deleted.¶
Clustering¶
As more keys are added to the hash table, more collisions are likely to occur.
Since each collision requires a linear probe to find the next available slot, the keys begin to form clusters.
As the clusters grow larger, so too does the probability that the next key added to the table will result in a collision.
Clustering¶
- Empty Table: Each slot has a probability of 1/13 $(1/M)$.
What is the probability the next key will occupy the empty slot at position 4?
What is the probability the next key will occupy the empty slot at position 9?
Clustering¶
- Each empty slot has a probability of 1/13 $(1/M)$.
What is the probability the next key will occupy the empty slot at position 4?
- 1/13
What is the probability the next key will occupy the empty slot at position 9?
- 5/13
Clustering¶
This type of clustering is known as primary clustering since it occurs near the original hash position.
As the clusters grow larger, so too does the length of the search needed to find the next available slot.
Clustering: Modified Linear Probe¶
When probing to find the next available slot, a loop is used to iterate through the table entries.
The order in which the entries are visited form a probe sequence.
- slot = ( home + $i * c$ ) % $M$, where $i=0,1,2,…$
- where $i$ is the $i^{th}$ probe in the sequence
- home is the original position $(i = 0)$:
home = (key % $M$)
- slot = ( home + $i * c$ ) % $M$, where $i=0,1,2,…$
How do we choose c?¶
Clustering: Modified Linear Probe¶
$c$ and $M$ must be coprime
- GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
- Why?
If $M$ is prime, then $c$ can take any value.
If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:
- 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0
Clustering: Modified Linear Probe¶
$c$ and $M$ must be coprime
- GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
- Why?
If $M$ is prime, then $c$ can take any value.
If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:
- 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0
If $M=13$ and $c=3$, the visit order is:
- 5, 8, 11, 1, 4, 7, 10, 0, 3, 6, 9, 12
Clustering: Modified Linear Probe¶
$c$ and $M$ must be coprime
- GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
- Why?
If $M$ is prime, then $c$ can take any value.
If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:
- 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0
If $M=13$ and $c=3$, the visit order is:
- 5, 8, 11, 1, 4, 7, 10, 0, 3, 6, 9, 12
If $M=10$ and $c=2$, the visit order is:
- 4, 6, 8, 0, 2, 4, 6, 8, 0
Quadratic Probing¶
slot = ( home + $i^2 $) % $M$
Quadratic probing eliminates primary clustering by increasing the distance between each probe in the sequence.
Introduces the problem of secondary clustering
- when two keys map to the same table entry and have the same probe sequence.
- for example, 388 and 258 (both hash to slot 11).
No guarantee the quadratic probe will visit every entry in the table.
If $M$ is prime, then at least half of them will be visited.
Double Hashing¶
The quadratic probe distributes the keys by increasing steps in the probe sequence.
But the same sequence is followed by multiple keys that map to the same table entry (secondary clustering)
This occurs because the probe equation is based solely on the original hash slot.
In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:
- slot = ( home + $i * hp\ (key))\ \ $ % $M$
Double Hashing¶
In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:
- slot = ( home + $i * hp\ (key) )\ \ $ % $M$
While the step size remains constant throughout the probe,
- multiple keys that map to the same table entry will have different probe sequences.
Double Hashing¶
To reduce clustering, the second hash function $hp(\cdot)$ should not be the same as the main hash function:
- $hp(\cdot)\not= h(\cdot)$
It should produce a valid index in the range $0 < c < M$.
A simple choice for the second hash function takes the form:
- $hp(key) = 1 + key\ \% P$
- $P < M$
Double Hashing¶
In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:
- slot = [ home + $(i * (1 + key\%P)) ]\ \ \% M$
While the step size remains constant throughout the probe,
- multiple keys that map to the same table entry will have different probe sequences.
Double Hashing¶
Used to resolve collisions since it reduces both primary and secondary clustering.
To ensure every table entry is visited during the probing, the table size must be a prime number.
Rehashing¶
How to decide how big a hash table should be?
After some time, we may need to make room for more entries
We need a new hash table
Cannot simply copy -- we have to rebuild or rehash the entire table
Experience has shown that hashing works best when the table is no more than $\approx \frac{3}{4}$ full.
if the hash table is to be expanded, it should be done before the table becomes full.
The ratio between the number of keys in the hash table and the size of the table is called the load factor.
In practice, a hash table should be expanded before the load factor reaches 80%.
Rule of Thumb: Double the table size $(2M + 1)$.
Separate Chaining¶
- We can eliminate collisions if we use linked lists.
- The linked lists are commonly referred to as chains
- This technique of collision resolution is known as separate chaining.
Separate Chaining¶
Hash table is constructed as an array of linked lists.
New keys can be prepended to the linked list since the nodes are in no particular order.
Deleting is straight forward
Separate Chaining¶
Also known as open hashing since the keys are stored outside the table (not in the array).
- closed addressing to describe open hashing, and
- open addressing is used to describe closed hashing
The "addressing" term refers to the possible locations of the keys in relation to the table entries.
- In open addressing, the keys may have been stored in an open slot different from the one to which it originally mapped
- while in closed addressing, the key is contained within the entry to which it mapped.
Hash Functions¶
A “perfect” hash function will map every key to a different table entry, resulting in no collisions.
- Very hard to achieve, possible when:
- keys are within a small range
- keys are known beforehand
- Very hard to achieve, possible when:
Goal: design a hash function that will distribute the keys across the range as evenly as possible.
Hash Functions - Desirable Properties¶
Easy to compute
Resulting index cannot be changing.
- When applied multiple times to the same key, it must always return the same index value
If the key consists of multiple parts, every part should contribute in the computation of the resulting index value.
The table size should be a prime number, especially with the '%' operator.
Hash Functions¶
Division (Remainder)
- $h(key) = key\ \%\ M$
Truncation
- For very large numbers, some digits can be ignored.
- example: 4873152 (7 digit number) in 0 - 999,
- we can use location 1(8), 3(3) and 6(2), to get 832.
- For very large numbers, some digits can be ignored.
Hash Functions...¶
- Folding
- key is split into multiple parts.
- combined into a single value by adding or multiplying the parts.
- example: 4873152 ; split into 48, 731, and 52 ; add them up.
- 48 + 731 + 52 = 831 (location)
String Hashing¶
Convert string representation to integer
Simplest approach is to sum the ASCII values of the individual characters.
- works well with small hash tables.
- pneumonoultramicroscopicsilicovolcanokoniosis
- The total value of this is: 4888
Another approach is to use a polynomial:
$s_0a^{n-1} + s_1a^{n-2} + \ldots + s_{n-3}a^2 + s_{n-2}a + s_{n-1}$
- $a$ is non-zero
- $s_i$ is the ASCII value of the $i^{th}$ character in the string
- $n$ is the length of the string
works well even if the strings are short